{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**L'objectif principal de ce tutoriel est de mettre en évidence l'importance des étapes de pré-traitement   (pre-processing) que nous devons suivre pour toute analyse de données. Parfois, nous avons tendance à oublier la syntaxe du programme python pour le pré-traitement. L'ensemble de données est utilisé selon ma convenance pour afficher la sortie souhaitée. J'espère que ce tutoriel sera une référence rapide pour tout le monde**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T19:29:29.604871Z",
     "start_time": "2019-12-15T19:29:29.519838Z"
    },
    "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
    "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5"
   },
   "outputs": [],
   "source": [
    "from pandas import *\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib notebook\n",
    "from scipy.stats import mode\n",
    "\n",
    "import os\n",
    "print(os.listdir(\"../input\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T19:29:47.797949Z",
     "start_time": "2019-12-15T19:29:47.426636Z"
    },
    "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0",
    "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a"
   },
   "outputs": [],
   "source": [
    "diabetes = pd.read_csv('../input/pima-diabetes/diabetes.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T19:29:51.061293Z",
     "start_time": "2019-12-15T19:29:50.967667Z"
    }
   },
   "outputs": [],
   "source": [
    "diabetes.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "### Exploration de données"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "#### Vérifiez si les types de données sont conformes aux attentes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T19:31:21.458699Z",
     "start_time": "2019-12-15T19:31:21.431864Z"
    }
   },
   "outputs": [],
   "source": [
    "diabetes.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "### Optimisations de la mémoire"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T19:32:07.311699Z",
     "start_time": "2019-12-15T19:32:07.272143Z"
    }
   },
   "outputs": [],
   "source": [
    "memory = diabetes.memory_usage()\n",
    "print(memory)\n",
    "print(\"Total Memory Usage = \",sum(memory))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T19:34:04.792252Z",
     "start_time": "2019-12-15T19:34:04.76643Z"
    }
   },
   "outputs": [],
   "source": [
    "diabetes.iloc[:,0:9] = diabetes.iloc[:,0:9].astype('float16')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T19:34:07.063785Z",
     "start_time": "2019-12-15T19:34:07.034318Z"
    }
   },
   "outputs": [],
   "source": [
    "memory = diabetes.memory_usage()\n",
    "print(memory)\n",
    "print(\"Total Memory Usage = \",sum(memory))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "## Vérifier les statistiques récapitulatives"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T19:35:19.725292Z",
     "start_time": "2019-12-15T19:35:19.554923Z"
    }
   },
   "outputs": [],
   "source": [
    "diabetes.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "diabetes[\"Outcome\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "## Vérifier les valeurs aberrantes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T19:35:39.587952Z",
     "start_time": "2019-12-15T19:35:38.839115Z"
    }
   },
   "outputs": [],
   "source": [
    "fig, axs = plt.subplots()\n",
    "sns.boxplot(data=diabetes,orient='h',palette=\"Set2\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "### Traiter les valeurs aberrantes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T19:36:35.395439Z",
     "start_time": "2019-12-15T19:36:35.352932Z"
    }
   },
   "outputs": [],
   "source": [
    "q75, q25 = np.percentile(diabetes[\"Insulin\"], [75 ,25])\n",
    "iqr = q75-q25\n",
    "print(\"IQR\",iqr)\n",
    "whisker = q75 + (1.5*iqr)\n",
    "print(\"Upper whisker\",whisker)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T19:36:59.725422Z",
     "start_time": "2019-12-15T19:36:59.712013Z"
    }
   },
   "outputs": [],
   "source": [
    "diabetes[\"Insulin\"] = diabetes[\"Insulin\"].clip(upper=whisker)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T19:37:03.68992Z",
     "start_time": "2019-12-15T19:37:02.945977Z"
    }
   },
   "outputs": [],
   "source": [
    "fig, axs = plt.subplots()\n",
    "sns.boxplot(data=diabetes,orient='h',palette=\"Set2\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "### Vérifier les valeurs manquantes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T19:37:28.227236Z",
     "start_time": "2019-12-15T19:37:28.144997Z"
    }
   },
   "outputs": [],
   "source": [
    "diabetes.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T20:39:26.214967Z",
     "start_time": "2019-12-15T20:39:26.1829Z"
    }
   },
   "outputs": [],
   "source": [
    "diabetes.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T19:37:53.9697Z",
     "start_time": "2019-12-15T19:37:53.93697Z"
    }
   },
   "outputs": [],
   "source": [
    "\n",
    "print((diabetes[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',\n",
    "       'BMI', 'DiabetesPedigreeFunction', 'Age']] == 0).sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T19:38:07.882015Z",
     "start_time": "2019-12-15T19:38:07.77646Z"
    }
   },
   "outputs": [],
   "source": [
    "diabetes.loc[:,['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',\n",
    "       'BMI']] = diabetes[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',\n",
    "       'BMI']].replace(0, np.NaN)\n",
    "diabetes.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T19:38:17.688382Z",
     "start_time": "2019-12-15T19:38:17.659868Z"
    }
   },
   "outputs": [],
   "source": [
    "diabetes.isnull().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "## Gérer les valeurs manquantes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "## A. Supprimer les lignes ayant NaN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T19:39:20.722474Z",
     "start_time": "2019-12-15T19:39:20.671333Z"
    }
   },
   "outputs": [],
   "source": [
    "print(\"Size before dropping NaN rows\",diabetes.shape,\"\\n\")\n",
    "\n",
    "nan_dropped = diabetes.dropna()\n",
    "\n",
    "print(nan_dropped.isnull().sum())\n",
    "print(\"\\nSize after dropping NaN rows\",nan_dropped.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "## Supprimer des lignes / colonnes contenant plus qu'un certain pourcentage de NaN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T19:40:14.89074Z",
     "start_time": "2019-12-15T19:40:14.862914Z"
    }
   },
   "outputs": [],
   "source": [
    "diabetes.isnull().mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T19:40:35.939348Z",
     "start_time": "2019-12-15T19:40:35.862913Z"
    }
   },
   "outputs": [],
   "source": [
    "print(\"Size before dropping NaN rows\",diabetes.shape,\"\\n\")\n",
    "\n",
    "col_dropped = diabetes.loc[:, diabetes.isnull().mean() < .4]\n",
    "row_dropped = diabetes.loc[diabetes.isnull().mean(axis=1) < .4, :]\n",
    "\n",
    "print(nan_dropped.isnull().sum())\n",
    "print(\"\\nSize after dropping Columns with rows\",col_dropped.shape)\n",
    "print(\"Size after dropping Columns with rows\",row_dropped.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "### Imputation de valeurs manquantes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "#### Nous pouvons imputer les valeurs manquantes de plusieurs manières. Voici les 3 façons les plus courantes de faire"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "#### 1. Une valeur constante considérée comme \"normale\" dans le domaine\n",
    "#### 2. Statistiques récapitulatives telles que moyenne, médiane, mode\n",
    "#### 3. Une valeur estimée par algorithme ou modèle prédictif"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "### Binning (Groupement des données par classe)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T19:44:26.903778Z",
     "start_time": "2019-12-15T19:44:26.813935Z"
    }
   },
   "outputs": [],
   "source": [
    "bins = [0,25,30,35,40,100]\n",
    "\n",
    "group_names = ['malnutrition', 'Under-Weight', 'Healthy', 'Over-Wight',\"Obese\"]\n",
    "diabetes['BMI_Class'] = pd.cut(diabetes['BMI'], bins, labels=group_names)\n",
    "diabetes.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "diabetes.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T19:44:41.437755Z",
     "start_time": "2019-12-15T19:44:41.384777Z"
    }
   },
   "outputs": [],
   "source": [
    "#Impute the values:\n",
    "diabetes['BMI_Class'].fillna((diabetes['BMI_Class']).mode()[0], inplace=True)\n",
    "diabetes['Insulin'].fillna((diabetes['Insulin']).mean(), inplace=True)\n",
    "diabetes['Pregnancies'].fillna((diabetes['Pregnancies']).median(), inplace=True)\n",
    "\n",
    "# #Now check the #missing values again to confirm:\n",
    "print(diabetes.isnull().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "### **Scaling (Mise à l'échelle)**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T19:45:14.080132Z",
     "start_time": "2019-12-15T19:45:14.048222Z"
    }
   },
   "outputs": [],
   "source": [
    "vector = np.random.chisquare(1,500)\n",
    "print(\"Mean\",np.mean(vector))\n",
    "print(\"SD\",np.std(vector))\n",
    "print(\"Range\",max(vector)-min(vector))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T19:45:19.400394Z",
     "start_time": "2019-12-15T19:45:18.892582Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import MinMaxScaler\n",
    "range_scaler = MinMaxScaler()\n",
    "range_scaler.fit(vector.reshape(-1,1))\n",
    "range_scaled_vector = range_scaler.transform(vector.reshape(-1,1))\n",
    "print(\"Mean\",np.mean(range_scaled_vector))\n",
    "print(\"SD\",np.std(range_scaled_vector))\n",
    "print(\"Range\",max(range_scaled_vector)-min(range_scaled_vector))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "### Normalisation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T19:45:38.187703Z",
     "start_time": "2019-12-15T19:45:38.134032Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "standardizer = StandardScaler()\n",
    "standardizer.fit(vector.reshape(-1,1))\n",
    "std_scaled_vector = standardizer.transform(vector.reshape(-1,1))\n",
    "print(\"Mean\",int(np.mean(std_scaled_vector)))\n",
    "print(\"SD\",int(np.std(std_scaled_vector)))\n",
    "print(\"Range\",max(std_scaled_vector)-min(std_scaled_vector))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "### Dummification"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T20:02:52.855342Z",
     "start_time": "2019-12-15T20:02:52.748305Z"
    }
   },
   "outputs": [],
   "source": [
    "dummified_data = pd.concat([diabetes.iloc[:,:-1],pd.get_dummies(diabetes['BMI_Class'])],axis=1)\n",
    "dummified_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "en"
   },
   "source": [
    "### Reshape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T20:03:07.733297Z",
     "start_time": "2019-12-15T20:03:07.711623Z"
    }
   },
   "outputs": [],
   "source": [
    "vector.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T20:03:12.174131Z",
     "start_time": "2019-12-15T20:03:12.152417Z"
    }
   },
   "outputs": [],
   "source": [
    "row_vector = vector.reshape(-1,1)\n",
    "row_vector.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T20:03:15.145229Z",
     "start_time": "2019-12-15T20:03:15.101556Z"
    }
   },
   "outputs": [],
   "source": [
    "col_vector = vector.reshape(1,-1)\n",
    "col_vector.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T20:03:18.053346Z",
     "start_time": "2019-12-15T20:03:18.029424Z"
    }
   },
   "outputs": [],
   "source": [
    "matrix = vector.reshape(10,50)\n",
    "matrix.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "en"
   },
   "source": [
    "### Pivot Table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T20:03:41.364995Z",
     "start_time": "2019-12-15T20:03:41.317247Z"
    }
   },
   "outputs": [],
   "source": [
    "#Determine pivot table\n",
    "df = pd.DataFrame({\"A\": [\"foo\", \"foo\", \"foo\", \"foo\", \"foo\",\n",
    "                       \"bar\", \"bar\", \"bar\", \"bar\"],\n",
    "                    \"B\": [\"one\", \"one\", \"one\", \"two\", \"two\",\n",
    "                          \"one\", \"one\", \"two\", \"two\"],\n",
    "                    \"C\": [\"small\", \"large\", \"large\", \"small\",\n",
    "                          \"small\", \"large\", \"small\", \"small\",\n",
    "                          \"large\"],\n",
    "                    \"D\": [1, 2, 2, 3, 3, 4, 5, 6, 7],\n",
    "                    \"E\": [2, 4, 5, 5, 6, 6, 8, 9, 9]})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T20:04:13.246568Z",
     "start_time": "2019-12-15T20:04:13.191943Z"
    }
   },
   "outputs": [],
   "source": [
    "df.head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T20:04:14.783329Z",
     "start_time": "2019-12-15T20:04:14.697076Z"
    }
   },
   "outputs": [],
   "source": [
    "table = pivot_table(df, values='D', index=['A', 'B'],\n",
    "                  columns=['C'], aggfunc=np.sum)\n",
    "table\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot_table.html\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "en"
   },
   "source": [
    "### Crosstab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-15T20:04:20.965978Z",
     "start_time": "2019-12-15T20:04:20.791659Z"
    }
   },
   "outputs": [],
   "source": [
    "pd.crosstab(df[\"D\"],df[\"B\"],margins=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "### Fusion de dataframes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\n",
    "                        'B': ['B0', 'B1', 'B2', 'B3'],\n",
    "                        'C': ['C0', 'C1', 'C2', 'C3'],\n",
    "                        'D': ['D0', 'D1', 'D2', 'D3']},\n",
    "                       index=[0, 1, 2, 3])\n",
    "    \n",
    "\n",
    "df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],\n",
    "                        'B': ['B4', 'B5', 'B6', 'B7'],\n",
    "                        'C': ['C4', 'C5', 'C6', 'C7'],\n",
    "                        'D': ['D4', 'D5', 'D6', 'D7']},\n",
    "                       index=[4, 5, 6, 7])\n",
    "    \n",
    "\n",
    "df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],\n",
    "                       'B': ['B8', 'B9', 'B10', 'B11'],\n",
    "                        'C': ['C8', 'C9', 'C10', 'C11'],\n",
    "                        'D': ['D8', 'D9', 'D10', 'D11']},\n",
    "                       index=[8, 9, 10, 11])\n",
    "    \n",
    "\n",
    "frames = [df1, df2, df3]\n",
    "\n",
    "result = pd.concat(frames)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "en"
   },
   "source": [
    "### Plotting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "grade = pd.read_csv(\"../input/grades/Grade.csv\")\n",
    "grade.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "grade.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axs = plt.subplots()\n",
    "sns.regplot(x=\"StudentId\",y=\"OverallPct1\",data=grade,scatter=True,fit_reg=False)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axs = plt.subplots()\n",
    "sns.boxplot(data=grade.iloc[:,1:-1],orient='h',palette=\"Set2\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Basic Plots\n",
    "fig, axs = plt.subplots(ncols=3)\n",
    "sns.distplot(grade['English1'],ax=axs[0])\n",
    "sns.distplot(grade['Math1'],ax=axs[1])\n",
    "sns.distplot(grade['Science1'],ax=axs[2])\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Axis range and granularity are very important\n",
    "fig, axs = plt.subplots(ncols=3)\n",
    "sns.distplot(grade['English1'],ax=axs[0],bins=20)\n",
    "sns.distplot(grade['Math1'],ax=axs[1],bins=20)\n",
    "sns.distplot(grade['Science1'],ax=axs[2],bins=20)\n",
    "axs[0].set_xlim(0,100)\n",
    "axs[1].set_xlim(0,100)\n",
    "axs[2].set_xlim(0,100)\n",
    "axs[0].set_ylim(0,0.12)\n",
    "axs[1].set_ylim(0,0.12)\n",
    "axs[2].set_ylim(0,0.12)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Asthetics\n",
    "fig, axs = plt.subplots()\n",
    "sns.distplot(grade['English1'], bins=10, rug=True, rug_kws={\"color\": \"red\"},\n",
    "             kde_kws={\"color\": \"black\", \"lw\": 2, \"label\": \"KDE\"},\n",
    "             hist_kws={\"histtype\": \"step\", \"lw\": 3,\"color\": \"green\"})\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axs = plt.subplots()\n",
    "normal_dist = np.random.randn(1,1000)\n",
    "normal_plot = sns.distplot(normal_dist)\n",
    "normal_plot.set(xlabel='Value', ylabel='Frequency')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axs = plt.subplots()\n",
    "vector = np.random.chisquare(1,500)\n",
    "vector_plot = sns.distplot(vector)\n",
    "vector_plot.set(xlabel='Value', ylabel='Probability')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iris = pd.read_csv(\"../input/iris-dataset/Iris.csv\")\n",
    "iris.head()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axs = plt.subplots(nrows=2)\n",
    "sns.swarmplot(x=\"Species\", y=\"SepalLengthCm\",ax=axs[0], data=iris)\n",
    "sns.swarmplot(x=\"Species\", y=\"PetalLengthCm\",ax=axs[1], data=iris)\n",
    "plt.show();"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(1)\n",
    "plt.figure(figsize = (12,8))\n",
    "plt.scatter(iris['PetalLengthCm'], iris['PetalWidthCm'], s=np.array(iris.Species == 'Iris-setosa'), marker='^', c='green', linewidths=5)\n",
    "plt.scatter(iris['PetalLengthCm'], iris['PetalWidthCm'], s=np.array(iris.Species == 'Iris-versicolor'), marker='^', c='orange', linewidths=5)\n",
    "plt.scatter(iris['PetalLengthCm'], iris['PetalWidthCm'], s=np.array(iris.Species == 'Iris-virginica'), marker='o', c='blue', linewidths=5)\n",
    "plt.xlabel('PetalLengthCm')\n",
    "plt.ylabel('PetalWidthCm')\n",
    "plt.legend(loc = 'upper left', labels = ['Setosa', 'versicolor', 'virginica'])\n",
    "plt.show();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**Pour d'autres graphiques et analyses de données exploratoires, veuillez vous référer à mon kernel ici.**\n",
    "\n",
    "https://www.kaggle.com/shravankoninti/starter-code-and-eda-on-iris-species\n",
    "![](http://)</div><i class=\"fa fa-lightbulb-o \"></i>"
   ]
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4"
  },
  "nbTranslate": {
   "displayLangs": [
    "*"
   ],
   "hotkey": "alt-t",
   "langInMainMenu": true,
   "sourceLang": "en",
   "targetLang": "fr",
   "useGoogleTranslate": true
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
